<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>CUDA Standard Algorithms &raquo; Parallel Scan | Taskflow QuickStart</title>
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,400i,600,600i%7CSource+Code+Pro:400,400i,600" />
  <link rel="stylesheet" href="m-dark+documentation.compiled.css" />
  <link rel="icon" href="favicon.ico" type="image/x-icon" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <meta name="theme-color" content="#22272e" />
</head>
<body>
<header><nav id="navigation">
  <div class="m-container">
    <div class="m-row">
      <span id="m-navbar-brand" class="m-col-t-8 m-col-m-none m-left-m">
        <a href="https://taskflow.github.io"><img src="taskflow_logo.png" alt="" />Taskflow</a> <span class="m-breadcrumb">|</span> <a href="index.html" class="m-thin">QuickStart</a>
      </span>
      <div class="m-col-t-4 m-hide-m m-text-right m-nopadr">
        <a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
          <path id="m-doc-search-icon-path" d="m6 0c-3.31 0-6 2.69-6 6 0 3.31 2.69 6 6 6 1.49 0 2.85-0.541 3.89-1.44-0.0164 0.338 0.147 0.759 0.5 1.15l3.22 3.79c0.552 0.614 1.45 0.665 2 0.115 0.55-0.55 0.499-1.45-0.115-2l-3.79-3.22c-0.392-0.353-0.812-0.515-1.15-0.5 0.895-1.05 1.44-2.41 1.44-3.89 0-3.31-2.69-6-6-6zm0 1.56a4.44 4.44 0 0 1 4.44 4.44 4.44 4.44 0 0 1-4.44 4.44 4.44 4.44 0 0 1-4.44-4.44 4.44 4.44 0 0 1 4.44-4.44z"/>
        </svg></a>
        <a id="m-navbar-show" href="#navigation" title="Show navigation"></a>
        <a id="m-navbar-hide" href="#" title="Hide navigation"></a>
      </div>
      <div id="m-navbar-collapse" class="m-col-t-12 m-show-m m-col-m-none m-right-m">
        <div class="m-row">
          <ol class="m-col-t-6 m-col-m-none">
            <li><a href="pages.html">Handbook</a></li>
            <li><a href="namespaces.html">Namespaces</a></li>
          </ol>
          <ol class="m-col-t-6 m-col-m-none" start="3">
            <li><a href="annotated.html">Classes</a></li>
            <li><a href="files.html">Files</a></li>
            <li class="m-show-m"><a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
              <use href="#m-doc-search-icon-path" />
            </svg></a></li>
          </ol>
        </div>
      </div>
    </div>
  </div>
</nav></header>
<main><article>
  <div class="m-container m-container-inflatable">
    <div class="m-row">
      <div class="m-col-l-10 m-push-l-1">
        <h1>
          <span class="m-breadcrumb"><a href="cudaStandardAlgorithms.html">CUDA Standard Algorithms</a> &raquo;</span>
          Parallel Scan
        </h1>
        <nav class="m-block m-default">
          <h3>Contents</h3>
          <ul>
            <li><a href="#CUDASTDParallelScanIncludeTheHeader">Include the Header</a></li>
            <li><a href="#CUDASTDWhatIsAScanOperation">What is a Scan Operation?</a></li>
            <li><a href="#CUDASTDScanItems">Scan a Range of Items</a></li>
            <li><a href="#CUDASTDScanTransformedItems">Scan a Range of Transformed Items</a></li>
          </ul>
        </nav>
<p>Taskflow provides standard template methods for scanning a range of items on a CUDA GPU.</p><section id="CUDASTDParallelScanIncludeTheHeader"><h2><a href="#CUDASTDParallelScanIncludeTheHeader">Include the Header</a></h2><p>You need to include the header file, <code>taskflow/cuda/algorithm/scan.hpp</code>, for using the parallel-scan algorithm.</p><pre class="m-code"><span class="cp">#include</span><span class="w"> </span><span class="cpf">&lt;taskflow/cuda/algorithm/find.hpp&gt;</span></pre></section><section id="CUDASTDWhatIsAScanOperation"><h2><a href="#CUDASTDWhatIsAScanOperation">What is a Scan Operation?</a></h2><p>A parallel scan task performs the cumulative sum, also known as <em>prefix sum</em> or <em>scan</em>, of the input range and writes the result to the output range. Each element of the output range contains the running total of all earlier elements using the given binary operator for summation.</p><img class="m-image" src="scan.png" alt="Image" /></section><section id="CUDASTDScanItems"><h2><a href="#CUDASTDScanItems">Scan a Range of Items</a></h2><p><a href="namespacetf.html#a2e1b44c84a09e0a8495a611cb9a7ea40" class="m-doc">tf::<wbr />cuda_inclusive_scan</a> computes an inclusive prefix sum operation using the given binary operator over a range of elements specified by <code>[first, last)</code>. The term &quot;inclusive&quot; means that the i-th input element is included in the i-th sum. The following code computes the inclusive prefix sum over an input array and stores the result in an output array.</p><pre class="m-code"><span class="k">const</span><span class="w"> </span><span class="kt">size_t</span><span class="w"> </span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1000000</span><span class="p">;</span>
<span class="kt">int</span><span class="o">*</span><span class="w"> </span><span class="n">input</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">N</span><span class="p">);</span><span class="w">  </span><span class="c1">// input  vector</span>
<span class="kt">int</span><span class="o">*</span><span class="w"> </span><span class="n">output</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">N</span><span class="p">);</span><span class="w">  </span><span class="c1">// output vector</span>

<span class="c1">// initializes the data</span>
<span class="k">for</span><span class="p">(</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">&lt;</span><span class="n">N</span><span class="p">;</span><span class="w"> </span><span class="n">input</span><span class="p">[</span><span class="n">i</span><span class="o">++</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rand</span><span class="p">());</span><span class="w"> </span>

<span class="c1">// create an execution policy</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaStream</span><span class="w"> </span><span class="n">stream</span><span class="p">;</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaDefaultExecutionPolicy</span><span class="w"> </span><span class="nf">policy</span><span class="p">(</span><span class="n">stream</span><span class="p">);</span>

<span class="c1">// queries the required buffer size to scan N elements using the given policy</span>
<span class="k">auto</span><span class="w"> </span><span class="n">bytes</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="n">policy</span><span class="p">.</span><span class="n">scan_bufsz</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">N</span><span class="p">);</span>
<span class="k">auto</span><span class="w"> </span><span class="n">buffer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_device</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">byte</span><span class="o">&gt;</span><span class="p">(</span><span class="n">bytes</span><span class="p">);</span>

<span class="c1">// computes inclusive scan over input and stores the result in output</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cuda_inclusive_scan</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span><span class="w"> </span>
<span class="w">  </span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">input</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">output</span><span class="p">,</span><span class="w"> </span><span class="p">[]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">b</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="k">return</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b</span><span class="p">;},</span><span class="w"> </span><span class="n">buffer</span>
<span class="p">);</span>

<span class="c1">// synchronizes and verifies the result</span>
<span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span>

<span class="k">for</span><span class="p">(</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">&lt;</span><span class="n">N</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w">  </span><span class="n">assert</span><span class="p">(</span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="mi">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">input</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="p">}</span>

<span class="c1">// delete the device memory</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">input</span><span class="p">);</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">output</span><span class="p">);</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span></pre><p>The scan algorithm runs <em>asynchronously</em> through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results. Since the GPU scan algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from <code><a href="classtf_1_1cudaExecutionPolicy.html#af25648b3269902b333cfcd58665005e8" class="m-doc">tf::<wbr />cudaDefaultExecutionPolicy::<wbr />scan_bufsz</a></code>.</p><aside class="m-note m-warning"><h4>Attention</h4><p>You must keep the buffer alive before the scan call completes.</p></aside><p>On the other hand, <a href="namespacetf.html#aeb391c40120844318fd715b8c3a716bb" class="m-doc">tf::<wbr />cuda_exclusive_scan</a> computes an exclusive prefix sum operation. The term &quot;exclusive&quot; means that the i-th input element is <em>NOT</em> included in the i-th sum.</p><pre class="m-code"><span class="c1">// computes exclusive scan over input and stores the result in output</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cuda_exclusive_scan</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span><span class="w"> </span>
<span class="w">  </span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">input</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">output</span><span class="p">,</span><span class="w"> </span><span class="p">[]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">b</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="k">return</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b</span><span class="p">;},</span><span class="w"> </span><span class="n">buffer</span>
<span class="p">);</span>

<span class="c1">// synchronizes the execution and verifies the result</span>
<span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span>
<span class="k">for</span><span class="p">(</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">&lt;</span><span class="n">N</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w">  </span><span class="n">assert</span><span class="p">(</span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="mi">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">input</span><span class="p">[</span><span class="n">i</span><span class="mi">-1</span><span class="p">]);</span>
<span class="p">}</span></pre></section><section id="CUDASTDScanTransformedItems"><h2><a href="#CUDASTDScanTransformedItems">Scan a Range of Transformed Items</a></h2><p><a href="namespacetf.html#afa4aa760ddb6efbda1b9bab505ad5baf" class="m-doc">tf::<wbr />cuda_transform_inclusive_scan</a> transforms each item in the range <code>[first, last)</code> and computes an inclusive prefix sum over these transformed items. The following code multiplies each item by 10 and then compute the inclusive prefix sum over 1000000 transformed items.</p><pre class="m-code"><span class="k">const</span><span class="w"> </span><span class="kt">size_t</span><span class="w"> </span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1000000</span><span class="p">;</span>
<span class="kt">int</span><span class="o">*</span><span class="w"> </span><span class="n">input</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">N</span><span class="p">);</span><span class="w">  </span><span class="c1">// input  vector</span>
<span class="kt">int</span><span class="o">*</span><span class="w"> </span><span class="n">output</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">N</span><span class="p">);</span><span class="w">  </span><span class="c1">// output vector</span>

<span class="c1">// initializes the data</span>
<span class="k">for</span><span class="p">(</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">&lt;</span><span class="n">N</span><span class="p">;</span><span class="w"> </span><span class="n">input</span><span class="p">[</span><span class="n">i</span><span class="o">++</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rand</span><span class="p">());</span><span class="w"> </span>

<span class="c1">// create an execution policy</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaStream</span><span class="w"> </span><span class="n">stream</span><span class="p">;</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaDefaultExecutionPolicy</span><span class="w"> </span><span class="nf">policy</span><span class="p">(</span><span class="n">stream</span><span class="p">);</span>

<span class="c1">// queries the required buffer size to scan N elements using the given policy</span>
<span class="k">auto</span><span class="w"> </span><span class="n">bytes</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="n">policy</span><span class="p">.</span><span class="n">scan_bufsz</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">N</span><span class="p">);</span>
<span class="k">auto</span><span class="w"> </span><span class="n">buffer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_device</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">byte</span><span class="o">&gt;</span><span class="p">(</span><span class="n">bytes</span><span class="p">);</span>

<span class="c1">// computes inclusive scan over transformed input and stores the result in output</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cuda_transform_inclusive_scan</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span><span class="w"> </span>
<span class="w">  </span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">input</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">output</span><span class="p">,</span><span class="w"> </span>
<span class="w">  </span><span class="p">[]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">b</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b</span><span class="p">;</span><span class="w"> </span><span class="p">},</span><span class="w">  </span><span class="c1">// binary scan operator</span>
<span class="w">  </span><span class="p">[]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">a</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">a</span><span class="o">*</span><span class="mi">10</span><span class="p">;</span><span class="w"> </span><span class="p">},</span><span class="w">          </span><span class="c1">// unary transform operator</span>
<span class="w">  </span><span class="n">buffer</span>
<span class="p">);</span>

<span class="c1">// wait for the scan to complete</span>
<span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span>

<span class="c1">// verifies the result</span>
<span class="k">for</span><span class="p">(</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">&lt;</span><span class="n">N</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w">  </span><span class="n">assert</span><span class="p">(</span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="mi">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">input</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">10</span><span class="p">);</span>
<span class="p">}</span>

<span class="c1">// delete the device memory</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">input</span><span class="p">);</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">output</span><span class="p">);</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span></pre><p>Similarly, <a href="namespacetf.html#a2e739895c1c73538967af060ca714366" class="m-doc">tf::<wbr />cuda_transform_exclusive_scan</a> performs an exclusive prefix sum over a range of transformed items. The following code computes the exclusive prefix sum over 1000000 transformed items each multiplied by 10.</p><pre class="m-code"><span class="k">const</span><span class="w"> </span><span class="kt">size_t</span><span class="w"> </span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1000000</span><span class="p">;</span>
<span class="kt">int</span><span class="o">*</span><span class="w"> </span><span class="n">input</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">N</span><span class="p">);</span><span class="w">  </span><span class="c1">// input  vector</span>
<span class="kt">int</span><span class="o">*</span><span class="w"> </span><span class="n">output</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">N</span><span class="p">);</span><span class="w">  </span><span class="c1">// output vector</span>

<span class="c1">// initializes the data</span>
<span class="k">for</span><span class="p">(</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">&lt;</span><span class="n">N</span><span class="p">;</span><span class="w"> </span><span class="n">input</span><span class="p">[</span><span class="n">i</span><span class="o">++</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rand</span><span class="p">());</span><span class="w"> </span>

<span class="c1">// create an execution policy</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaStream</span><span class="w"> </span><span class="n">stream</span><span class="p">;</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaDefaultExecutionPolicy</span><span class="w"> </span><span class="nf">policy</span><span class="p">(</span><span class="n">stream</span><span class="p">);</span>

<span class="c1">// queries the required buffer size to scan N elements using the given policy</span>
<span class="k">auto</span><span class="w"> </span><span class="n">bytes</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="n">policy</span><span class="p">.</span><span class="n">scan_bufsz</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="n">N</span><span class="p">);</span>
<span class="k">auto</span><span class="w"> </span><span class="n">buffer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_device</span><span class="o">&lt;</span><span class="n">std</span><span class="o">::</span><span class="n">byte</span><span class="o">&gt;</span><span class="p">(</span><span class="n">bytes</span><span class="p">);</span>

<span class="c1">// computes exclusive scan over transformed input and stores the result in output</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cuda_transform_exclusive_scan</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span><span class="w"> </span>
<span class="w">  </span><span class="n">input</span><span class="p">,</span><span class="w"> </span><span class="n">input</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">output</span><span class="p">,</span><span class="w"> </span>
<span class="w">  </span><span class="p">[]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">b</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">b</span><span class="p">;</span><span class="w"> </span><span class="p">},</span><span class="w">  </span><span class="c1">// binary scan operator</span>
<span class="w">  </span><span class="p">[]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">a</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">a</span><span class="o">*</span><span class="mi">10</span><span class="p">;</span><span class="w"> </span><span class="p">},</span><span class="w">          </span><span class="c1">// unary transform operator</span>
<span class="w">  </span><span class="n">buffer</span>
<span class="p">);</span>

<span class="c1">// wait for the scan to complete</span>
<span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span>

<span class="c1">// verifies the result</span>
<span class="k">for</span><span class="p">(</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">&lt;</span><span class="n">N</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w">  </span><span class="n">assert</span><span class="p">(</span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">output</span><span class="p">[</span><span class="n">i</span><span class="mi">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">input</span><span class="p">[</span><span class="n">i</span><span class="mi">-1</span><span class="p">]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mi">10</span><span class="p">);</span>
<span class="p">}</span>

<span class="c1">// delete the device memory</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">input</span><span class="p">);</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">output</span><span class="p">);</span>
<span class="n">cudaFree</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span></pre></section>
      </div>
    </div>
  </div>
</article></main>
<div class="m-doc-search" id="search">
  <a href="#!" onclick="return hideSearch()"></a>
  <div class="m-container">
    <div class="m-row">
      <div class="m-col-m-8 m-push-m-2">
        <div class="m-doc-search-header m-text m-small">
          <div><span class="m-label m-default">Tab</span> / <span class="m-label m-default">T</span> to search, <span class="m-label m-default">Esc</span> to close</div>
          <div id="search-symbolcount">&hellip;</div>
        </div>
        <div class="m-doc-search-content">
          <form>
            <input type="search" name="q" id="search-input" placeholder="Loading &hellip;" disabled="disabled" autofocus="autofocus" autocomplete="off" spellcheck="false" />
          </form>
          <noscript class="m-text m-danger m-text-center">Unlike everything else in the docs, the search functionality <em>requires</em> JavaScript.</noscript>
          <div id="search-help" class="m-text m-dim m-text-center">
            <p class="m-noindent">Search for symbols, directories, files, pages or
            modules. You can omit any prefix from the symbol or file path; adding a
            <code>:</code> or <code>/</code> suffix lists all members of given symbol or
            directory.</p>
            <p class="m-noindent">Use <span class="m-label m-dim">&darr;</span>
            / <span class="m-label m-dim">&uarr;</span> to navigate through the list,
            <span class="m-label m-dim">Enter</span> to go.
            <span class="m-label m-dim">Tab</span> autocompletes common prefix, you can
            copy a link to the result using <span class="m-label m-dim">⌘</span>
            <span class="m-label m-dim">L</span> while <span class="m-label m-dim">⌘</span>
            <span class="m-label m-dim">M</span> produces a Markdown link.</p>
          </div>
          <div id="search-notfound" class="m-text m-warning m-text-center">Sorry, nothing was found.</div>
          <ul id="search-results"></ul>
        </div>
      </div>
    </div>
  </div>
</div>
<script src="search-v2.js"></script>
<script src="searchdata-v2.js" async="async"></script>
<footer><nav>
  <div class="m-container">
    <div class="m-row">
      <div class="m-col-l-10 m-push-l-1">
        <p>Taskflow handbook is part of the <a href="https://taskflow.github.io">Taskflow project</a>, copyright © <a href="https://tsung-wei-huang.github.io/">Dr. Tsung-Wei Huang</a>, 2018&ndash;2024.<br />Generated by <a href="https://doxygen.org/">Doxygen</a> 1.9.6 and <a href="https://mcss.mosra.cz/">m.css</a>.</p>
      </div>
    </div>
  </div>
</nav></footer>
</body>
</html>
